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ABSTRACT 

The discovery of genomic structural variants (SVs), 
such as copy number variants (CNVs), is essential to 
understand genetic variation of human populations 
and complex diseases. Over recent years, the 
advent of new high-throughput sequencing (HTS) 
platforms has opened many opportunities for SVs 
discovery, and a very promising approach consists 
in measuring the depth of coverage (DOC) of reads 
aligned to the human reference genome. At present, 
few computational methods have been developed 
for the analysis of DOC data and all of these 
methods allow to analyse only one sample at time. 
For these reasons, we developed a novel algorithm 
(JointSLM) that allows to detect common CNVs 
among individuals by analysing DOC data from 
multiple samples simultaneously. We test 
JointSLM performance on synthetic and real data 
and we show its unprecedented resolution that 
enables the detection of recurrent CNV regions as 
small as 500 bp in size. When we apply JointSLM 
to analyse chromosome one of eight genomes 
with different ancestry, we identify 3000 regions 
with recurrent CNVs of different frequency and 
size: hierarchical clustering on these regions segre- 
gates the eight individuals in two groups that reflect 
their ancestry, demonstrating the potential utility of 
JointSLM for population genetics studies. 



INTRODUCTION 

The discovery of structural variants (SVs), including copy 
number variants (CNVs) and balanced rearrangements 
such as inversions and translocations, is deeply changing 
our understanding of the human genotype (2,1). Recently, 
multiple studies have discovered an abundance of struc- 
tural variations of DNA segments that range from kilo- 
bases (kb) to megabases (Mb) in size (3). SVs have been 
found among normal individuals (4-6) while others par- 
ticipate in causing various disease states (7). For instance, 
genetic variants associated with cancer often result from 
rearrangements and alterations in proto-oncogenes or 
tumour suppressor genes (8-10), and Alzheimer and 
Parkinson's diseases have been associated with changes 
in gene dosage resulting from alterations in copy 
number (1 1,12). 

In the last decade SVs detection has been performed 
with microarray technologies. The high-density CGH 
arrays (aCGH) and SNP genotyping arrays afford a 
level of resolution that allows CNV boundaries to be 
called with relatively high precision at a genome-wide 
level. However, although microarray platforms have 
been successfully used to identify CNVs, their resolution 
is limited by either the density of the array itself (for 
aCGH) or by the density of known SNP loci (for SNP 
arrays). For instance, currently available array platforms 
that consist of more than 1 million probes have a lower 
limit of detection of ~5 to lOkb (6,13). 

Over recent years, the advent of new high-throughput 
sequencing (HTS) platforms, such as Illumina's Genome 
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Analyzer and ABFs SOLiD, have opened many oppor- 
tunities for SV discovery and has enabled initiatives such 
as the 1000 Genomes project (http://www.l000genomes 
.org) that aims to sequence the genomes of more than 
1000 individuals to extend our knowledge on human 
genetic variation. HTS technologies are able to sequence 
a full human genome per week generating milions of short 
nucleotide sequences in a single instrument run. 

The first HTS-based approach to detect SVs were based 
on paired-end read mapping (PEM), which identifies 
insertions and deletions by comparing the distance 
between mapped read pairs to the average insert size of 
the genomic library. Although this method is able to 
identify deletions smaller than 1 kb with high sensitivity, 
it does not allow the discovery of insertions larger than 
average insert size of the library and the exact borders 
of SVs in complex genomic regions rich in segmental du- 
plication (14,15). 

In this scenario, a very promising approach for the iden- 
tification of SVs using HTS technologies consists in 
measuring the depth of coverage (DOC) of reads aligned 
to the human reference genome (15). At present, few com- 
putational methods have been developed for the analysis 
of DOC data: Campbell et al. (16) use the Circular Binary 
Segmentation algorithm (17) originally developed for 
genomic hybridization microarray data, Chiang et al. 
(18) use a local change-point analysis technique, while 
Yoon et al. (19) developed a new method based on the 
significance testing that works on intervals of data points. 
Although these algorithms are very sensitive and specific 
in discovering SVs from DOC data, they allow to analyse 
only one sample at time. The simultaneous analysis of 
multiple samples can improve statistical strength in the 
identification of signals shared by the data, increasing 
the resolution of SVs detection. Moreover, the identifica- 
tion of signals shared by multiple samples can lead to the 
detection of regions of interest since disease-critical genes 
are more likely to be found in regions that are common or 
recurrent among samples. 

For these reasons, we have developed a novel algorithm, 
named JointSLM, that allows to analyse DOC signals 
from multiple samples simultaneously for the identifica- 
tion of common DNA events (recurrent CNVs) across 
individuals. By means of simulated data, we show that 
our algorithm is able to sensitively and accurately detect 
common structural variants as small as 500 bp in size. 
The comparison with other three state of the art 
methods show that our joint model allows one to obtain 
an unprecedented resolution in the detection of recurrent 
CNVs. We applied JointSLM to the DOC data of eight 
genomes and we demonstrate its unique advantage in 
population-based studies. 

MATERIALS AND METHODS 

DOC 

HTS technologies, such as Ulumina's Genome Analyzer 
and ABFs SOLiD, are able to generate milions of short 
nucleotide sequences in a single instrument run. Assuming 
the sequencing process is uniform, the number of reads 



mapping to a region follows a Poisson distribution and 
is expected to be proportional to the number of times the 
region appears in the DNA sample: a genomic region that 
has been deleted (duplicated) will have less (more) reads 
mapping to it than a region not deleted (duplicated). 

Following this assumption, the copy number of any 
genomic region can be estimated by counting the number 
of aligned reads to the reference genome. Campbell et al. 
(16) and Chiang et al. (18) were the firsts to use this 
approach to detect copy-number alterations between 
tumour and healthy samples of the same individuals, 
while more recently Yoon et al. (19) proposed to use the 
read count in sequence data to look for genomic regions 
that differ in copy number between normal individuals 
of the 1000 genomes project. 

Fhe strategy to obtain DOC data consists in counting 
the number of mapped reads in non-overlapping windows 
of fixed length and then correcting each window by GC 
content: Campbell et al. (16) and Chiang et al. (18) used 
the logarithm of the ratio between the number of aligned 
reads from a tumour sample and the match normal 
sample, while Yoon et al. (19) used the number of 
aligned reads every 100-bp, corrected for GC content 
and median normalized to copy number 2 [median nor- 
malization is defined as 2 x (read count)/(mean read count 
over the genome) for each sample]. Fhe DOC data 
obtained with this approach is mathematically very 
similar to the signal obtained from aCGH log 2 ratios. 
Deletions or duplications are identified as a decrease or 
increase in coverage across multiple consecutive windows. 
Moreover, like aCGH log 2 -ratios, DOC sequences have 
noise caused by mapping errors and random fluctuations 
in genome coverage. For these reasons, the events in DOC 
can be detected using the same types of segmentation al- 
gorithms that are used for aCGH data. 

In a recent paper (20), we developed a fast and powerful 
algorithm to segment aCGH data in which the log 2 ratios 
were modelled as a sum of two independent stochastic 
processes by means of Shifting Level Model (SLM). Due 
to their similarity with aCGH genomic profiles, also DOC 
genomic profiles can be considered to be generated by the 
sum of two processes: a biological process due to a real 
variation of the number of DNA copies and a white noise 
process that mimics experimental error. 

Here, we introduce a novel method that extend the 
SLM algorithm from the classical univariate form to a 
multivariate form to segment multiple DOC signals sim- 
ultaneously for the identification of common alterations. 
For each sample studied in this work, we take into 
consideration the logarithm in base 2 of the median- 
normalized DOC data obtained as in Yoon et al. (19): 
DOC was measured by counting the number of mapped 
reads in 100-bp windows, correcting for GC content and 
then median-normalizing to copy number 2. The 100-bp 
windows were chosen because at 30x coverage, the distri- 
bution of read counts is well approximated by a normal 
distribution, thus permitting us to assume normality in 
our mathematical modelling. DOC data from multiple 
samples were modelled as a sequential processes made of 
N observation each. 
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The multivariate form of the SLM 

We consider M sequential processes (samples) with N 
observations each (100-bp windows) and we denote with 
t (t = 1, . . . , M) and i(i= 1,. . . ,N) the respective indexes. 
We model the sequential process x = X\,...,x h ..., x N , 
where x t = (x t \, . . . ,x iM )', as the sum of two independent 
stochastic processes: 



(1 - z,_i) ■ + (z,_i) ■ (/x + 5/), 



(1) 

(2) 



where m,- = (m n , . . . ,m iM ) is the vector of the unobserved 
mean level and e, is the vector of white noises. The white 
noise vector e,- follows a multivariate normal distribution 
with mean \x e = [0] and covariance matrix E £ ; z t are 
random variables taking the values in [0, 1] with 
probabilities r\ = Pr(z t =1), 1 — r\ = Pr(z/ = 0); 5,- are 
random vectors that follow a multivariate normal distri- 
bution and li,- is the vector of the means. 

The process m t is controlled by the process z,-: when 
z,_i = 0, m t is the same as m,_i and when z,-_i = 1, m t 
takes its new value according to a multivariate Gaussian 
law with mean li and covariance matrix E e independently 
of m i _ 1 : 



et ~ N(0, E 6 ), 
mi ~ N(ji, E M ). 



(3) 
(4) 



Combining the definitions given above, the joint distri- 
bution of the observations and latent variables, given the 
parameters, has the following form: 



p(x, m, z|0) = p(x\m, E f ) ■ p(m\z, fi, E M ) ■ p(z\r]) 

N 



(5) 



~[p(m i+ i \m u z h n, E^) ■ 



(=0 



where 0 = (li, E^, E e , r|). 

Since the DOC data are modelled as the sum of two 
independent stochastic processes, the expected value of 
Xj is equal to \i and its covariance matrix is given by the 
sum of the covariances of the two processes: 



E[xi] = ji, 



E = E^ + E 6 . 



(7) 



Using (7) we can introduce a different parametrization of 
the SLM by defining the parameter <d such that E M = co ■ E 
and E e = (1 co) ■ E. 

When we deal with multiple DOC signals (profiles) 
simultaneously, we have to take into account some 
fundamental considerations: (i) each DOC profile is 
characterized by its technical noise caused by mapping 
errors and random fluctuations in genome coverage and 
(ii) in each DOC profile, a CNV can be present at variable 
copy number in comparison with a reference genome. 
For these reasons, the white noise distributions and 



mean level distributions can be considered independent 
across samples and we can write: 



M 



N(0, Z € ) = Y[N(0,a £ j), 
t=\ 

M 



(8) 
(9) 



where cr M and a M are the standard deviations of the 
normal distribution of the /-th sequential process for the 
white noise and the mean level distributions, respectively. 

With these assumptions, from a mathematical point 
of view, the random process z, is the only variable that 
correlates the samples. When z,- changes its value, the 
mean level of each sample have a shift. In this way, our 
joint model is able to detect common shift in the mean of 
multiple samples. 

The JointSLM algorithm 

The joint distribution of Equation (5) defines an 
Hidden Markov Model (HMM) of order one with state 
variable q t = (m h z,) and multivariate emission probability 
(see Supplementary Data). The fact that the multivariate 
SLM is an HMM allows us to make use of the several 
algorithms developed for these kinds of models. 

To estimate the parameters of the Multivariate SLM, 
we develop a two step algorithm that follows the idea of 
(20), based on dynamic programming. The inputs to the 
algorithm are the sequences x = . . . , x M } to be jointly 
segmented, the initial estimate of the number of states K 0 
and the parameters co and r\. In the first step, we estimate 
the parameters by means of the Baum and Welch 
re-estimation strategy (21), while in the second step we 
estimate the best state sequence s (the z,- variables, i.e. 
the points of mean shift) by means of the Viterbi algo- 
rithm. Finally, we convert the data from log space to 
copy number space and we calculate the median of the 
data that belong to each segment. A detailed description 
of the algorithm and the study of the effect of the param- 
eters K Q , co and r\ on the performance of JointSLM are 
reported in Supplementary Data. 



RESULTS 



(6) Synthetic data analysis 



To assess the performance of JointSLM algorithm in iden- 
tifying common DNA events of different size, we made an 
intensive simulation based on synthetic data generated 
from the GC-adjusted DOC data of chromosomes 1 and 
X of the male individual NA 18507. To estimate specificity, 
we generated synthetic chromosomes by sampling 10000 
100-bp windows from chromosome 1 to simulate normal 
copy number. To estimate sensitivity, we added to the 
normal copy number chromosomes nine deletions of size 
200 bp, 300 bp, 400 bp, 500 bp, 700 bp, 1 kb, 2.5 kb, 5kb 
and 10 kb sampled from chromosome X. To minimize 
the sampling of random false positives we removed all 
gaps, segmental duplications, telomeres/centromeres and 
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regions with known CNVs from the Database of Genomic 
Variants (DGV, http://projects.tcag.ca/variation/) and the 
Genome Structural Variation Consortium (GSV, http:// 
www . sanger . ac.uk/humgen /cnv /42mio/) . 

We applied JointSLM with different parameter settings 
on data sets made of 10, 30 and 50 normal copy number 
chromosomes, and we evaluated false positive rate (FPR) 
by counting the number of detected alterations. The spe- 
cificity of the algorithm can be controlled by the param- 
eters T) and 03 (see Supplementary Data): the higher are the 
values of r| and co and the larger is the number of detected 
false positive (FP) events. For instance, in the 10 samples 
analysis (Figure lb), when we set r\ = 10~ 3 and co = 0.3 
we detect an average of 9.4 FP events that range between 
100 and 500 bp in size (6.0 events of 100 bp, 2.6 of 200 bp, 
0.62 of 300 bp, 0.12 of 400 bp and 0.08 of 500 bp), while 
using a more conservative set of parameters (r\ = 10~ 6 and 
co = 0.1), we detect an average of 0.91 FP events (0.66 of 
100 bp, 0.22 of 200 bp, 0.02 of 300 bp and 0.01 of 400 bp). 
In the analysis of the 30 samples data set (Supplementary 
Figure), we detected an average of 21.6 FP events (FPR = 
0.2%) with co = 0.3 and r\ = 10~ 3 and an average of 9.6 
(0.1%) events with co = 0.1 and r\ = 10" 6 , while for the 50 
sample analysis the FPR grows to 0.3% with co = 0.3 and 
T) = 10~ 3 and to 0.2% with co = 0.1 and r\ = 10" 6 . These 
results show that the use of the most conservative set of 
parameters allows us to obtain a global FPR smaller than 
0.01%. 

Moreover, the great majority of the FP detected by 
our method in the 30 and 50 samples data set is made of 
100 bp events, and in all the cases JointSLM does not 
identify FP events larger than 400 bp in size. 

To quantify the detection power of our algorithm, we 
applied JointSLM with different parameter settings 
on simulated data sets made of 10, 30 and 50 synthetic 
chromosomes with common deletions inserted in a 
fraction of samples / (with / that ranges between 0.1 
to 1) and we estimated TPR as the fraction of correctly 
detected alterations. The results of these simulations 
(Figure la and Supplementary Figures SI and S2) show 
that the resolution of the algorithm (i.e. the ability of 
identifying regions of different size) does not depend on 
the number of samples analysed simultaneously but is 
strongly dependent on the fraction of altered samples / 

When /is small (i.e. only 10 or 20% of the samples are 
altered), JointSLM is able to correctly locate only regions 
larger than 1 kb in size, while for higher values of / (larger 
than 50%) the resolution of the algorithm drastically in- 
creases, allowing the identification of very small alter- 
ations (smaller than 1 kb). By setting co = 0.1 and 
r| = 10~ 6 we were able to correctly detect regions greater 
than 1 kb and shared in more than 20% of the samples, 
while when we set the parameters to less conservative 
values (co to 0.2/0.3 and r\ set to 10" 4 / 10" 3 ), we 
observed a dramatic improvement in detecting small alter- 
ations: in these cases, JointSLM is able to correctly detect 
common alterations smaller than 500 bp and shared 
among the 20% of the samples. 

In order to evaluate the ability of our algorithm in 
correctly detecting the boundaries of common DNA 
events (breakpoints problem), we generated synthetic 



chromosomes in which common deletions are not perfect- 
ly aligned but randomly shifted of n 100 bp windows (with 
n that ranges from 1 to 5). The resolution of the algorithm 
is not affected by these perturbations: also in this case we 
were able to detect genomic events larger than 1 kb in size 
with co = 0.1 and r\ = 10~ 6 and smaller than 1 kb with 
co = 0.3 and r\ = 10~ 3 (see Supplementary Figures S3-S5). 

The extensive simulation study we performed show that 
the parameter co allows to control both sensitivity and 
specificity, while r\ is able to control only specificity and 
has weak effect on sensitivity: these results suggest to 
use very conservative values of r\ (10~ 5 / 10~ 6 ) in order 
to contain FP detection and tuning co to obtain the desired 
level of sensitivity. 

Comparison with other algorithms 

To demonstrate the advantages of analysing multiple 
samples at once by means of our joint model instead of 
using single sample models, we compared the performance 
of JointSLM with other three algorithms: the CBS (17) 
and EWT (19) methods that have been already used 
in the analyses of DOC data and the GLAD method 
(22) previously used for the analysis of array CGH data. 
To this end, we applied the three methods with default 
parameter settings to the synthetic data sets of the 
previous paragraph and we calculated the TPR as the 
fraction of correctly detected alterations and the FPR as 
the average number of FP detected in each chromosome. 
To call gain and losses with CBS algorithm, we used the 
same thresholds used for the JointSLM algorithm (see 
Supplementary Data). The results of these analyses and 
the comparison with JointSLM performance are detailed 
in Figure 2. 

All algorithms perform well in terms of specificity: they 
detect a very small number of FB events and all the FP 
identified are smaller than 500 bp in size. In terms of sen- 
sitivity, our joint model outperforms the other single 
sample algorithms: JointSLM is able to detect very small 
alterations (200 bp) with a TPR larger than 0.8 showing 
an unprecedented sensitivity in detecting CNVs, while the 
other methods allow to detect only events larger than 500 
bp. A more detailed study of Figure 2 shows that the 
number of FP events detected by JointSLM decreases 
when the number of samples analysed at once increases. 
This is probably the most interesting feature of our algo- 
rithm: analysing a large number of samples improves spe- 
cificity without affecting sensitivity. 

These results clearly suggests that the simultaneous 
analysis of multiple samples with our joint model 
improves the statistical strength in the identification of 
small CNVs and the use of JointSLM algorithm allows 
to extend the detection power of CNVs. 

Real data analysis 

In order to identify common CNVs among multiple 
individuals, we applied JointSLM to the DOC data of 
eight genomes. These included a CEU trio of European 
ancestry (NA12878, NA12891 and NA12892), a YRI trio 
of Yoruba Nigerian ethnicity (NA19238, NA19239 and 
NA 19240) that belong to 1000 Genomes project and two 
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Figure 1. TPR and FPR estimate for different values of r| and to on synthetic data made of 10 chromosomes. Each point of the plot is obtained by 
averaging the JointSLM results over 100 repeated simulations, (a) Each curve represents the TPR estimate against deletion events of different size. 
In each plot are reported the curves for different values of fraction of altered samples / (with / that ranges between 0.1 and 1). (b) Each curve 
represent the FPR estimate against the size of false detected events. 



additional published genomes, including a Yoruban indi- 
vidual NA18507 (23) and a Chinese individual YH (24). 

To minimize type-I error and obtain a very robust set 
of CNVs, we ran JointSLM using a conservative set of 
parameters (K Q = 20, r| = 10~ 6 ra = 0.1), and we identified 



a total of 3000 CNV regions in chromosome 1 (for some 
examples of the JointSLM segmentation see Supple- 
mentary Figures S6-S9): 820 (27%) are smaller than 
500 bp, 1131 (38%) ranges between 500 and 1000 bp, 760 
(25%) ranges between 1 and 5 kb and 289 (10%) are larger 
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Figure 2. TPR and FPR for JointSLM, EWT, CBS and GLAD on the synthetic chromosomes data sets. TPR is calculated as the average fraction 
of correctly detected alterations in each chromosome and the FPR as the average number of FP detected in each chromosome. For JointSLM, we 
report the results obtained in simulated datasets made of 10, 30 and 50 synthetic chromosomes. 



than 5kb (Table 1). All the CNVs detected in this analysis 
are listed in Supplementary Table SI. 

Of all these regions, 958 (32%) are shared by only one 
sample, 457 (15%) by two samples, 330 (11%) by three 
samples, 233 (8%) by four samples, 165 (5%) by five sam- 
ples, 155 (5%) by six samples, 168 (6%) by seven samples 
and 534 (18%) are present in all of the eight samples. 
According to Nguyen et al. (25), we found that the CNV 
regions identified by our algorithm are significantly 
overrepresented close to telomeres and centromeres 
(Supplementary Figure S10). Additionally, 799 of the 
2180 RefSeq genes of chromosome 1 are contained or 
overlap with our set of called regions. 

In order to validate the genomic events detected by our 
algorithm, we compared our calls with the known CNVs 
in DGV version 10 and each call was considered validated 
if there is any overlap of 1 bp or greater. The global val- 
idation rate is about 58%, and around 43% (3255/7666) 
of known CNVs were found in our call set. When we 
consider called regions that ranges between 1 and 5kb 
the validation rate is around 70-80%, and goes up to 
95-100% for genomic events greater than 5kb. On the 
other hands, when we take into consideration CNVs 
smaller than 1 kb, the validation rate ranges between 
40% and 60% (see Supplementary Data for more details). 

As a further test, we compared our set of calls with a set 
of common CNVs recently assessed by GSV Consortium 
using high resolution array-CGH platforms. The common 
CNVs were detected in 40 individuals (20 CEU Caucasian 
and 20 Yoruban samples) by means of a NimbleGen tiling 
array set of 42 million probes and include 748 CNV 
regions for chromosome 1. We found that 25% of the 
CNVs identified by JointSLM overlap with the GSV 



calls (Table 1) and around 50% of the GSV calls were 
present in our callset: for regions larger than 5kb 
we found that the overlap with GSV regions is around 
70% (70% for both regions that ranges between 5 and 
10 kb and regions larger than 10 kb), while for CNVs 
smaller than 5kb it reduces to 10-30% (35% for regions 
that ranges between 1 and 5kb and 10% for regions 
smaller than 1 kb). 

Lastly, we compared our set of calls with SVs detected 
by PEM-based approach. The SVs of two of the indi- 
viduals considered in this study (YH and NA18507) 
were previously analysed by means of PEM-based 
approach (23,24). To understand the differences between 
JointSLM and PEM-based methods in detecting known 
CNVs, we took the set of copy number variants of GSV as 
a set of true positive (TP), and we determined the propor- 
tion of TP identified by the two approaches. In the 
samples NA18507 and YH, JointSLM was able to 
identify 290 (39%) and 256 (34%) of the 748 CNV 
regions of the validation set, while PEM-based methods 
detected 125 (17%) and 79 (7%) (see Figure 3). 

These results show that our algorithm has good sensi- 
tivity with respect to PEM methods in identifying 
CNVs previously detected by array-CGH. However, 
there is a little overlap between our call set and the call 
sets obtained with PEM approaches: for NA18507 we 
found an overlap of 18% (184/996) and for individual 
YH an overlap of 23% (46/194). 

To understand if the discrepancy between PEM and 
our calls is due to detection limits of our algorithm, 
we calculated the median value of the DOC data for 
each non-overlapping region identified by PEM-based 
methods for both YH and NA18507. We found that 
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Table 1. Summary statistics for the CNVs detected by JointSLM on chromosome 1 



Number of samples that 
share the alterations 


100-500 bp 


500-1000 bp 


1-5 kb 


5-10 kb 


>10kb 


1 


142 (53%/ 19%) 


458 (33%/ 9%) 


318 (55%/ 23%) 


26 (100%/ 77%) 


14 (100%/ 86%) 


2 


95 (59%/ 23%) 


221 (43%/ 11%) 


107 (73%/48%) 


14 (100%/ 71%) 


20 (100%/ 95%) 


3 


109 (49%/ 20%) 


117 (45%/ 15%) 


91 (80%/41%) 


10 (100%/ 80%) 


3 (100%/ 100%) 


4 


77 (58%/ 19%) 


98 (48%/ 16%) 


48 (79%/42%) 


8 (100%/ 88%) 


2 (100%/ 50%) 


5 


39 (51%/28%) 


66 (48%/ 6%) 


45 (87%/ 40%) 


7 (100%/ 86%) 


8 (100%/ 100%) 


6 


56 (59%/ 29%) 


53 (57%/ 8%) 


33 (79%/ 48%) 


5 (100%/ 60%) 


8 (100%/ 75%) 


7 


75 (55%/ 20%) 


45 (51%/ 7%) 


29 (86%/ 38%) 


9 (100%/ 67%) 


10 (80%/ 40%) 


8 


227 (51%/ 14%) 


73 (55%/ 18%) 


89 (87%/45%) 


47 (96%/ 64%) 


98 (95%/ 62%) 


Total 


820 (53%/ 20%) 


1131 (42%/ 11%) 


760 (70%/ 35%) 


126 (98%/ 71%) 


163 (96%/ 70%) 



The number of CNVs detected by JointSLM are listed separately for different sizes and number of samples that share the alteration. In brackets are 
reported the proportion of JointSLM calls that overlap (by at least 1 bp) with CNV regions in the Database of Genomic Variants (before the /) and 
in the GSV validation call set (after the /). 



YH NA18507 



PEM DOC PEM DOC 




GSV GSV 

Figure 3. Venn diagram of the comparison between the regions called 
by JointSLM, PEM-based methods and by the GSV Consortium. 



Table 2. Summary statistics for the seven CNV clusters identified by 
Ward's hierarchical clustering 



Cluster 


N 


Size (bp) 


SD (%) 


SR (%) 


RefSeq (% 


) Class 


A 


434 


2 228566 


73 


16 


34 


Amp 


B 


653 


956 547 


29 


5.5 


44 


Del 


C 


545 


1 112 655 


63 


6.1 


40 


Amp 


D 


683 


1 192517 


31 


8.1 


61 


Del 


L 


183 


718817 


54 


6.5 


34 


Amp 


1 


242 


1 132 458 


23 


6.8 


46 


Del 


G 


260 


300 840 


19 


7.5 


53 


NA18507 



For each cluster we listed the total number of regions Amp, 
Amplification; Del, Deletion. (TV) and the total size in bp. We also 
reported the overlap between called regions and segmental duplications 
(SD), simple repeats (SR) and with RefSeq genes (RefSeq). 



94% (811/856) and 82% (122/148) of the non-overlapping 
regions identified by PEM methods in NA18507 and YH, 
respectively, have a median value that ranges between 1.2 
and 2.8 copies. These results suggest that the differences 
between PEM and our calls are not due to detection 
limits of the JointSLM algorithm but to the fact that the 
PEM- and DOC-based approaches allow to detect differ- 
ent classes of SVs: the discrepancy between PEM- and 
DOC-specific events has been previously reported (19) 
and it is explained by the fact that DOC-specific events 
show a large overlap with annotated segmental 
duplications (SDs), while PEM-specific events show an 
enrichment with simple repeat (SR) events. 

JointSLM and clustering 

To demonstrate the utility of our algorithm for population 
genetic analysis, we applied clustering analysis to the 
matrix of the CNV regions identified by JointSLM in 
chromosome 1. We performed Ward's hierarchical clus- 
tering with the aim to group both CNV regions and indi- 
viduals. We used Pearson correlation coefficient for 
clustering individuals and the euclidean distance for clus- 
tering genomic events. 

Table 2 and Figure 4 report the results of the hierarch- 
ical clustering. Although no information on the identity of 



the individuals was used in the analysis, the algorithm was 
able to segregate the ancestry of the eight individuals in 
two main clusters: the first cluster include the european 
ancestry family and the Chinese individual, while the 
second cluster include the nigerian ancestry family and 
the Yoruban individual NA18507. The clustering on the 
genomic events identified seven groups of regions with 
complex patterns of CNVs. In particular, we were able 
to detect three clusters (A, C and E) that contain regions 
with common amplifications, three clusters (B, D and F) 
that contain regions with common deletions and a cluster 
(G) that is made of deletions present only in the individual 
NA18507. 

The genomic regions grouped in cluster A, E and F 
contain CNVs with high population frequency (shared 
among almost all of the eight individuals), while clusters 
B, C and D contain subgroups of CNVs that are primarily 
shared among Yoruban ancestry or european ancestry. 
As expected, the three clusters with common amplifica- 
tions showed a greater enrichment of annotated SDs 
compared with the clusters that contain deletions. SDs ac- 
counted for 73, 63 and 54% of the total base pairs of 
clusters A, C and E, while for clusters B, D and F we 
found an overlap of 29, 31 and 23%, respectively. 

Conversely, we observed that the clusters that contain 
common deletions (B, D and F) showed a greater 
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number of 
copies 




Figure 4. Hierarchical clustering on the estimated copy number of the 
3000 CNV regions detected by JointSLM on chromosome 1 with par- 
ameters r| = 10~ 6 , to = 0.1 and K 0 = 20. Each row represents a separate 
CNVs region and each column a separate individual. The coloured bars 
on the right of the figure represent clusters of genomic events that share 
similar CNV patterns over multiple individuals. 



enrichment of annotated genes compared with the ampli- 
fication clusters. The number of RefSeq genes that overlap 
with clusters A, B and C are 128, 151 and 43, while for the 
deletion clusters D, E and F we found an overlap with 236, 
360 and 117 RefSeq genes, respectively. 

We also studied the overlap between the CNV regions 
that belong to each of the seven clusters and annotated 
regions with SRs: we found that all the clusters have an 
average overlap of the order of 6-8% with SR events, with 
the exception of cluster A that have more than 15%. 
Clusters A contain 434 amplification regions with very 
complex pattern of CNVs: around 65% of the amplifica- 
tions that belong to cluster A have an estimated number of 
copies greater than 4. 



DISCUSSION 

We developed a novel algorithm that extend the univariate 
SLM to the multivariate case in order to detect recurrent 
shifts in the mean of multiple sequential processes. We 
applied JointSLM to DOC signals obtained from high 
coverage sequencing data in order to infer common 
CNVs among multiple individuals. 

The results obtained in simulated chromosomes show 
that JointSLM correctly detects recurrent CNV regions 
as small as 500 bp in size with sensitivity larger than 
90%. The comparison with other state of the art 
methods demonstrated the our joint model is able to 
obtain an unprecedented resolution in the analysis of 
DOC data (see Supplementary Data). We applied our al- 
gorithm on chromosome 1 of eight genomes and we 
identified 3000 regions with recurrent CNVs of different 
frequency and size. We validated in silico the 3000 CNVs 
regions by studying the overlap with the annotated CNVs 
of the Database of Genomic Variants. We found that 
more than 50% of the inferred regions overlap with 
annotated CNVs and the validation rate grows up to 
70-100% for regions larger than 1 kb. These results 
clearly show that the use of DOC data combined with 
our algorithm allows to obtain an unprecedented reso- 
lution in the identification of genomic structural variants. 

We also demonstrate the utility of JointSLM algorithm 
for population genetics analysis by applying cluster 
analysis to the inferred regions. Hierarchical clustering 
applied to the samples is able to separate the eight indi- 
viduals in two main clusters that reflects their ancestry, 
while cluster analysis on the regions allows to identify 
groups of CNVs that share structural features such as en- 
richment in segmental duplications, enrichment in simple 
repeats and gene content. 

JointSLM is also able to analyse DOC data from one 
sample at time obtaining very good results in terms of 
sensitivity and specificity (see Supplementary Data). 

The resolution of JointSLM strictly depends on the 
signal to noise ratio (SNR) of the data (see 'Materials 
and Methods' section): increasing the SNR of DOC data 
by reducing the sequencing error rate or augmenting the 
coverage of the sequencing experiments, will improve the 
performance of JointSLM in detecting small shifts in 
the signals. 
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The JointSLM algorithm can be also used to analyse 
multiple tumour samples data for the discovery of recur- 
rent copy number alterations. In this case, to estimate 
DNA copy numbers it is necessary to take into account 
cellularity and tumoural heterogeneity and for this reason 
we would need a more sophisticated approach similar to 
CGHCall (26) or FastCall (27) instead of using the simple 
rounding to the closest integer. Although all the analyses 
performed in this article were made on sequence data from 
the Illumina Genome Analyzer, the algorithm we de- 
veloped is generic and can be used to analyse DOC data 
produced by different HTS platforms. 

AVAILABILITY 

The JointSLM algorithm is implemented as an R package. 
The source code includes both R and Fortran codes. 
The JointSLM package and a brief manual is freely avail- 
able as Supplementary Data. 

SUPPLEMENTARY DATA 

Supplementary Data are available at NAR Online. 
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